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(57) 



ABSTRACT 



A search engine system assists users in locating web pages 
from which user-specified products can be purchased. Web 
pages located by a crawler program are scored, based on a 
set of rules, according to likelihood of including an online 
product offering. A query server accesses an index of the 
scored web pages to locate pages that are both responsive to 
a user's search query and likely to include a product offer- 
ing. In one embodiment, the responsive web pages are listed 
on a composite search results page together with products 
that satisfy the query. 

28 Claims, 9 Drawing Sheets 
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2. Second Analysis Stage 

In practice, die vast majority of the web pages on the 
World Wide Web are not associated with product offerings, 
and as such their corresponding product scores are low. As 
shown in FIG. 5, these web pages are excluded from the 5 
Product Spider database 147 by a filtering step 530. The 
filter is simply a threshold number, preferably thirty, that the 
web page product score must equal or exceed to satisfy the 
filter. Web pages having a product score below thirty are 
discarded 532 as inappropriate for the Product Spider data- 1Q 
base 147. Typically about 99% of all web pages in the World 
Wide Web are discarded in this manner. Those pages having 
product scores satisfying the filter criteria are retained. The 
corresponding URLs are submitted back 540 to the web 
crawler 160 for a second crawling stage 560. 

In other embodiments, such as those in which the index is 15 
also used to provide a general purpose web search engine, 
pages may be indexed without regard to their respective 
product scores. In still other embodiments, the filter com- 
prises multiple ranges of product score values with prede- 
termined minimum and maximum values. For example, four 20 
separate databases may be created for web pages having 
product score values of 20-40, 40-60, 60-80, and 80-100, 
respectively. In these latter embodiments the product scores 
r ma y optionally be omitted from the respective databas es^ ^ 

If the Product Spider database is not being constructedfor 15 
the first time, but rather is being updated, then the URLs \ 
from the existing database 147 are submitte d 550 W fEe^ 
second crawling stage 5GU as well. Duplication between trie 
previous database submissions 55U "and the latest web crawl 
submissions 540 are detected and removed (not shown). 30 

The second crawling stage 560 shown in FIG. 5 typically 
requires substantially less time than the first crawling stage 
510, as the number of web pages involved is considerably 
smaller. The results of the second web crawling stage are 
passed through a second page analyzing stage 570, wherein 35 
product scores are generated anew. In a second filtering 
stage 580, pages failing to satisfy the filter are once again 
discarded 582. Those pages satisfying the second filtering 
stage 580 are passed in step 590 to the index tool 164 for 
further processing. 40 

The second filtering stage 580 preferably uses the same 
criteria as the first filtering stage 530. In an alternative 
embodiment, the second filtering stage 580 may have either 
more or less discriminating criteria than the first filtering 
stage 530. 45 

3. Construction of the Product Spider Database 

The pages retained after the second filtering stage 580 
shown in FIG. 5 are passed to an indexing stage 590 wherein 
the index tool 164 creates the Product Spider database 147, 
fully text indexed by keyword 166. A given web page will 50 
contain multiple index keywords distributed throughout its 
text. The index tool 164 converts the information from a 
form organized by URL into a form organized by keyword. 
Schematically, the index tool 164 reorganizes the set of 
multiple pages (Page m , where m=l to M) containing mul- 55 
tiple Keywords (Word„, where n=l to N) such that Pagej 

C^WorcU, Page 2 (Z n Word„) , Page^WordJ is 

converted into Word^PageJ, Word 2 (^PageJ, . . . , 
Word^PageJ. 

As shown in FIG. 1, the database 147 includes, for each 60 
keyword 166, one or mo re w ph P a ff e a^pVe^* 1*7 with 
corresponding titles 168, squibs 169, and product scores 
170. All of the product scores will necessarily equal or 
exceed thirty in the preferred embodiment due to the second 
filtering stage 580. 65 

The web page addresses 167 stored in the Product Spider 
database 147 are preferably "canonicalized" URLs. URLs 



often include one or more strings of characters appended to 
the addressing information that specify, for example, a 
particular user ID, session ID, or transaction ID. These 
characters are not needed for accessing the web page, and 
are thus preferably discarded, resulting in a "canonical" 
URL for inclusion in the Product Spider database 147. 
Techniques for canonicalizing URLs are well known in the 
art. 

The title 168 entry of the database 147 is preferably 
duplicated directly from the title used for the web page, as 
identified by the appropriate HTML tags. If a web page has 
an inappropriate title, or is missing a title, a new tide is 
inserted into the database 147 as needed on a case by case 
basis. 

The squib 169 entry of the database 148 is generated 
automatically by the index tool 164. The squib corresponds 
to the initial series of words on a web page, up to a preset 
number of characters set at about two-hundred. In another 
embodiment, the squib displays relevant text extracted from 
the web page corresponding to the products offered for sale 
on the web page. 

The process illustrated in FIG. 5 may be used to update 
the Product Spider database 147 as often as desired. In a 
preferred embodiment, the Product Spider database 147 is 
updated every week, more preferably the database is updated 
every three or four days, and even more preferably it is 
updated every day. 

As indicated above, the Product Spider database 147 may 
alternatively be constructed without storing the product 
scores for each page. In one embodiment, for example, the 
database comprises only pages having a product score 
satisfying predetermined criteria, for example, requiring the 
product score to equal or exceed thirty (as in the filtering 
steps 530, 580 of FIG. 5). In another alternative 
embodiment, the database comprises multiple indexed tables 
created without storing the product scores, wherein each 
table is constructed from web pages having a product score 
satisfying unique criteria, for example, four separate indexed 
tables containing pages having product scores from 20-40, 
40 -60, 60-80 and 80-100, respectively. 

In another embodiment, the Product Spider database 147 
consists of multiple indexed tables, wherein each table is 
constructed from web pages that are distinguishable on the 
basis of some aspect of product offerings (ascertained from 
parsing the web pages) unrelated to product scores. In one 
embodiment, for example, the database 147 consists of 
separate tables for different categories of goods (e.g., books, 
music, videos, electronics, software, and toys). In another 
embodiment, a separate table is used for products unsuitable 
for children. In still another embodiment, different tables are 
constructed for web sites written in different languages 
(English, Japanese, German, etc.). In yet another 
embodiment, different tables are constructed for on-line and 
off-line product offerings. Under these embodiments, the 
page analyzer steps 520, 570 include searching for character 
strings judged to be associated with the various predefined 
categories. 

By constructing the Product Spider database 147 out of 
different tables having distinguishing characteristics, or 
retaining the equivalent information within one big table, the 
user is capable of conducting a more refined search within 
the Product Spider database 147. In one embodiment, for 
example, the Related Products hypertext link 380 is replaced 
by a pulldown menu comprising different categories corre- 
sponding to the distinctions retained within the Product 
Spider database 147 (e.g., books, music, video, and toys 
categories, on-line versus off-line offerings, goods versus 
services, etc.). 



5/27/05, EAST Version: 2.0.1.4 



